This exercise is designed to determine hypotheses of phylogenetic relationships between major biogeographic regions by means of assessing taxonomic relationships between all taxa known from these mountains. Taxa differ with regards to being described to Genera, species, and subspecies, and the hierarchical effects of these relationships sheds light on how similar populations are between different mountain ranges.
Here, we use kmeans clustering and hclust hierarchical clustering. I determine ideal group size of kmeans using the gap-statistic, and the ideal size for hclust using the more qualitative elbow method, with guidance from the kmeans group size.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
library(ape)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dendextend)
## Registered S3 method overwritten by 'dendextend':
## method from
## rev.hclust vegan
##
## ---------------------
## Welcome to dendextend version 1.15.1
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following objects are masked from 'package:ape':
##
## ladderize, rotate
## The following object is masked from 'package:permute':
##
## shuffle
## The following object is masked from 'package:stats':
##
## cutree
x=as.data.frame(read_csv(paste0(filepath,"TableS1_v2.csv")))
## Rows: 732 Columns: 51
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): From Bowie, From Dowsett, Exclude, Genus, Superspecies, Species, G...
## dbl (43): Clements, Bioko, Mt. Cameroon, Cameroon Highlands, Bamenda & Adama...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(x)
## From Bowie From Dowsett Exclude Clements
## Length:732 Length:732 Length:732 Min. : 329
## Class :character Class :character Class :character 1st Qu.:21948
## Mode :character Mode :character Mode :character Median :23890
## Mean :22811
## 3rd Qu.:28628
## Max. :31354
## Genus Superspecies Species Group
## Length:732 Length:732 Length:732 Length:732
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Subspecies Bioko Mt. Cameroon Cameroon Highlands
## Length:732 Min. :0.00000 Min. :0.00000 Min. :0.00000
## Class :character 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Mode :character Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.04918 Mean :0.06831 Mean :0.07104
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## Bamenda & Adamawa Lendu West Rift Rwenzori
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.08743 Mean :0.09426 Mean :0.1803 Mean :0.1339
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## East Rift Kabobo West Ethiopia East Ethiopia
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.000
## Mean :0.1639 Mean :0.09016 Mean :0.1407 Mean :0.127
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.000
## S Eth-N Ken Imatong Elgon West Kenya
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.04645 Mean :0.08607 Mean :0.1298 Mean :0.1475
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## Kenya-Aberdare Ngorongoro Meru Kilimanjaro
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.000
## Mean :0.1448 Mean :0.1148 Mean :0.1148 Mean :0.112
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000
## Taita Pare Usambara Nguu
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.05874 Mean :0.08333 Mean :0.1107 Mean :0.05601
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## Nguru Ukaguru Rubeho Uluguru
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.06284 Mean :0.0724 Mean :0.06967 Mean :0.1052
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## Udzungwa Southern Highlands Nyika Kaningina
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.000 Median :0.0000 Median :0.00000
## Mean :0.1202 Mean :0.112 Mean :0.1107 Mean :0.08197
## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.00000
## Dedza-Salima Zomba Thyolo Mulanje
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.06557 Mean :0.06831 Mean :0.06831 Mean :0.06694
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## Namuli Gorongosa Chimanimani N Drakensberg
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.05601 Mean :0.04508 Mean :0.07377 Mean :0.0806
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## S Drakensberg Cape Fold Mountains Angola
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.0929 Mean :0.06694 Mean :0.0806
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
x[is.na(x)]=0
colSums(x[,-c(1:9)])
## Bioko Mt. Cameroon Cameroon Highlands Bamenda & Adamawa
## 36 50 52 64
## Lendu West Rift Rwenzori East Rift
## 69 132 98 120
## Kabobo West Ethiopia East Ethiopia S Eth-N Ken
## 66 103 93 34
## Imatong Elgon West Kenya Kenya-Aberdare
## 63 95 108 106
## Ngorongoro Meru Kilimanjaro Taita
## 84 84 82 43
## Pare Usambara Nguu Nguru
## 61 81 41 46
## Ukaguru Rubeho Uluguru Udzungwa
## 53 51 77 88
## Southern Highlands Nyika Kaningina Dedza-Salima
## 82 81 60 48
## Zomba Thyolo Mulanje Namuli
## 50 50 49 41
## Gorongosa Chimanimani N Drakensberg S Drakensberg
## 33 54 59 68
## Cape Fold Mountains Angola
## 49 59
colnames(x)
## [1] "From Bowie" "From Dowsett" "Exclude"
## [4] "Clements" "Genus" "Superspecies"
## [7] "Species" "Group" "Subspecies"
## [10] "Bioko" "Mt. Cameroon" "Cameroon Highlands"
## [13] "Bamenda & Adamawa" "Lendu" "West Rift"
## [16] "Rwenzori" "East Rift" "Kabobo"
## [19] "West Ethiopia" "East Ethiopia" "S Eth-N Ken"
## [22] "Imatong" "Elgon" "West Kenya"
## [25] "Kenya-Aberdare" "Ngorongoro" "Meru"
## [28] "Kilimanjaro" "Taita" "Pare"
## [31] "Usambara" "Nguu" "Nguru"
## [34] "Ukaguru" "Rubeho" "Uluguru"
## [37] "Udzungwa" "Southern Highlands" "Nyika"
## [40] "Kaningina" "Dedza-Salima" "Zomba"
## [43] "Thyolo" "Mulanje" "Namuli"
## [46] "Gorongosa" "Chimanimani" "N Drakensberg"
## [49] "S Drakensberg" "Cape Fold Mountains" "Angola"
x2=x
# reformat names to be hierarchical
x2$Superspecies=paste(x2$Genus,x2$Superspecies)
x2$Species=paste(x2$Superspecies,x2$Species)
x2$Group=paste(x2$Species,x2$Group)
x2$Subspecies=paste(x2$Group,x2$Subspecies)
# test variables for coding
level="Genus"
ncluster=9
hcluster=9
xdata=x2
author="Cooper"
# removed ncluster variable
# now determines best group number
clustertaxa=function(level,ncluster=NULL,hcluster=NULL,xdata,author){
if(is.null(ncluster)==T){ncluster=5}
if(is.null(hcluster)==T){hcluster=5}
x3=xdata %>%
filter(Exclude!="Exclude") %>%
select(-`From Bowie`,-`From Dowsett`,-Exclude,-Clements)
xnames=x3[,which(colnames(x3)==level)]
x4=x3%>%select(-Superspecies,-Genus,
-Species,-Group,-Subspecies)
col.x=colnames(x4)
x4=as.data.frame(unclass(t(x4)))
colnames(x4)=xnames
for(i in 1:ncol(x4)){
x4[,i]=as.numeric(as.character(x4[,i]))
}
u.names=unique(xnames)
for(i in 1:length(u.names)){
target=u.names[i]
if(sum(colnames(x4)==target)<=1){
index=which(colnames(x4)==target)
nu.x=x4[,c(index,index)]
if(i==1){
x6=nu.x
}else{
x6=cbind(x6,nu.x[,1])
}
if(i==2){
x6=x6[,-2]
}
}
if(sum(colnames(x4)==target)>1){
xx=x4[,which(colnames(x4)==target)]
nu.x=rowSums(xx)
nu.x[nu.x>1]=1
if(i==1){
x6=nu.x
}else{
x6=cbind(x6,nu.x)
}
if(i==2){
x6=x6[,-2]
}
}
}
colnames(x6)=u.names
# manual plot for kmeans
#wss=(nrow(x6)-1)*sum(apply(x6,2,var))
#for(v in 2:40){
# wss[v]=sum(kmeans(x6,centers=v)$withinss)
#}
#plot(1:40,wss,type="b",xlab="Number of Clusters",
# ylab="Within groups sum of squares")
set.seed(123)
# ncluster.det=which(wss==min(wss))
# defined from above plot
clust.x=hclust(dist(x6),method="average")
wss=(nrow(x6)-1)*sum(apply(x6,2,var))
x.cut=cutree(clust.x,2:40)
for(v in 2:30){ # reducing number to be plotted
x.ssq=aggregate(x6,by=list(x.cut[,v]),function(x){sum(scale(x,scale=F)^2)})
ssq=rowSums(x.ssq[,-1])
TSS=sum(x.ssq[,-1])
wss[v]=TSS
}
# use hcluster here as a comparison to the kmeans clusters
plot(x=1:30,y=wss,pch=19,type='b',xlab="Number of Clusters",
ylab="Within groups sum of squares",main="H-Clust")
abline(v=hcluster)
plot(clust.x,main="Elbow Clust")
rect.hclust(clust.x,k=hcluster)
plot(clust.x,main="#K Clust")
rect.hclust(clust.x,k=ncluster)
x.tree=as.phylo(clust.x)
write.tree(x.tree,file=paste0(filepath,"hclust_",level,"_",author,".tre"))
#for(k in 1:nclusters){
# xclust=kmeans(x6,nclusters[k],nstart=20)
# return(xclust)
#}
print(fviz_nbclust(x6,kmeans,nstart=2,method="gap_stat",
nboot=100,k.max=30)+
labs(subtitle = "Kmeans: Gap Statistic"))
xclust=kmeans(x=x6,centers=ncluster)
assignments=as.data.frame(xclust$cluster)
write.csv(assignments,
paste0(filepath,"clusters_",level,
"_K",ncluster,"_",author,".csv"),
row.names = T,quote = F)
}
# prepare variables for tests
level="Superspecies"
#nclusters=5
# ensure we are excluding problem
xdata=x2 %>% filter(Exclude==0)
bowie.data=xdata%>%filter(`From Bowie`!=0)
dowsett.data=xdata%>%filter(`From Dowsett`!=0)
# my data
clustertaxa(level="Genus",xdata=xdata,
ncluster=10,hcluster=6,
author="Cooper")
# Bowie data
clustertaxa(level="Genus",xdata=bowie.data,
ncluster=4,hcluster=7,
author="Bowie")
# Dowsett data
clustertaxa(level="Genus",xdata=dowsett.data,
ncluster=4,hcluster=8,
author="Dowsett")
# this study
clustertaxa(level="Superspecies",xdata=xdata,
ncluster=10,hcluster=7,
author="Cooper")
# Bowie data
clustertaxa(level="Superspecies",xdata=bowie.data,
ncluster=20,hcluster=5,
author="Bowie")
clustertaxa(level="Superspecies",xdata=dowsett.data,
ncluster=13,hcluster=4,
author="Dowsett")
clustertaxa(level="Species",xdata=xdata,
ncluster=10,hcluster=4,
author="Cooper")
clustertaxa(level="Species",xdata=bowie.data,
ncluster=10,hcluster=4,
author="Bowie")
clustertaxa(level="Species",xdata=dowsett.data,
ncluster=5,hcluster=5,
author="Dowsett")
clustertaxa(level="Group",xdata=xdata,
ncluster=16,hcluster=4,
author="Cooper")
clustertaxa(level="Group",xdata=bowie.data,
ncluster=3,hcluster=6,
author="Bowie")
clustertaxa(level="Group",xdata=dowsett.data,
ncluster=10,hcluster=5,
author="Dowsett")
clustertaxa(level="Subspecies",xdata=xdata,
ncluster=8,hcluster=6,
author="Cooper")
clustertaxa(level="Subspecies",xdata=bowie.data,
ncluster=11,hcluster=8,
author="Bowie")
clustertaxa(level="Subspecies",xdata=dowsett.data,
ncluster=12,hcluster=5,
author="Dowsett")
The following is the “ideal” number of groups (partitions) within each dendrogram for each taxonomic treatise.
| Source | Genus | Superspecies | Species | Group | Subspecies | \(\mu\) | \(\sigma\) |
|---|---|---|---|---|---|---|---|
| Dowsett (1986) | 4 | 13 | 5 | 10 | 12 | 8.8 | 4.1 |
| Bowie (2003) | 4 | 20 | 10 | 3 | 11 | 9.6 | 6.8 |
| This study | 10 | 10 | 10 | 16 | 8 | 10.8 | 3.0 |
| Overall \(\mu\) | 6 | 14.3 | 8.3 | 9.6 | 10.3 | 9.73 | |
| Overall \(\sigma\) | 3.5 | 5.1 | 2.9 | 6.5 | 2.1 |
Dowsett=cbind(1:length(Dowsett),"Dowsett",Dowsett)
Bowie=cbind(1:length(Bowie),"Bowie",Bowie)
This.study=cbind(1:length(This.study),"This study",This.study)
colnames(Dowsett)=colnames(Bowie)=colnames(This.study)=c("X","Study","N.Clusters")
dat.x=data.frame(rbind(Dowsett,Bowie,This.study))
a=ggplot(dat.x,aes(x=X,y=N.Clusters,group=Study))
b=geom_line(aes(linetype=Study))
b2=geom_point()
c=theme_classic()
d=ylim(0,22)
e=xlab('Level')
e2=ylab("# of Clusters")
e3=scale_x_discrete(breaks=c(1,2,3,4,5),labels=c("Genus","Superspecies","Species",
"Group","Subspecies"))
plot1=a+b+c+d+b2+e+e2+e3
print(plot1)
ggsave(plot1,filename = paste0(filepath,"number_groups.jpg"),dpi = 400)
## Saving 7 x 5 in image
Note That this is a subjective measure.
| Source | Genus | Superspecies | Species | Group | Subspecies | \(\mu\) | \(\sigma\) |
|---|---|---|---|---|---|---|---|
| Dowsett (1986) | 8 | 4 | 5 | 5 | 5 | 5.4 | 1.5 |
| Bowie (2003) | 7 | 5 | 4 | 6 | 8 | 6 | 1.6 |
| This study | 9 | 7 | 4 | 4 | 6 | 6 | 2.1 |
| Overall \(\mu\) | 8.0 | 5.3 | 4,3 | 5.0 | 6.3 | 5.8 | |
| Overall \(\sigma\) | 1.0 | 1.5 | 0.6 | 1.0 | 1.5 |
a=ggplot(dat.x,aes(x=X,y=N.Clusters,group=Study))
b=geom_line(aes(linetype=Study))
b2=geom_point()
c=theme_classic()
d=ylim(0,22)
e=xlab('Level')
e2=ylab("# of Clusters")
e3=scale_x_discrete(breaks=c(1,2,3,4,5),labels=c("Genus","Superspecies","Species",
"Group","Subspecies"))
plot1=a+b+c+d+b2+e+e2+e3
print(plot1)
ggsave(plot1,filename = paste0(filepath,"number_groups.jpg"),dpi = 400)
## Saving 7 x 5 in image
Reported as \(kmeans - hclust\), as the gap-statistic is a less fallible measure. Thus, positive numbers have higher numbers of clusters for kmeans and negative values have higher numbers of clusters for the elbow method. The difference between these values can fluctuate greatly between iterations of the pipeline, especially with regards to the ideal number of K-means clusters.
| Source | Genus | Superspecies | Species | Group | Subspecies |
|---|---|---|---|---|---|
| Dowsett (1986) | -4 | 9 | 0 | 5 | 7 |
| Bowie (2003) | -3 | 15 | 6 | -3 | 3 |
| This study | 1 | 3 | 6 | 12 | 2 |